Back

Nature Biotechnology

Springer Science and Business Media LLC

Preprints posted in the last 30 days, ranked by how well they match Nature Biotechnology's content profile, based on 147 papers previously published here. The average preprint has a 0.34% match score for this journal, so anything above that is already an above-average fit.

1
Structure-Led Exploration of the Metagenome Yields Novel RNA-Guided Nucleases with Broad PAM Diversity

de los Santos, E. L.; Rieber, L.; Wang, M.; Catherman, S.; Hatfield, S.; Bowen, T.

2026-03-29 genomics 10.64898/2026.03.27.714800 medRxiv
Top 0.1%
34.2%
Show abstract

CRISPR-Cas bacterial adaptive immune systems use reprogrammable RNA guide sequences to specifically bind and cleave nucleic acids, which have been repurposed for easy and relatively efficient genomic editing. Despite its widespread use in biomedical research, the large size of Cas9 hinders AAV-mediated therapeutic delivery. Smaller RNA-guided nucleases could improve AAV gene therapy delivery, but their application is limited by their rarity among bacterial genomes and the restrictive sequence preferences of known systems, especially compared to the diversity of PAMs seen in the highly abundant Cas9 systems. Existing methods for identification of novel CRISPR subtypes rely on sequencing ever more bacterial genomes and comparing sequence homology. Using recent advances in protein structure prediction and comparison, we have identified and characterized proteins from known and novel compact RNA guided nucleases and demonstrated that their PAM preference diversity meets or exceeds that of Cas9 systems or the compact IscB and TnpB systems. This discovery has enabled us to demonstrate editing in eukaryotic cells with multiple novel subtypes, which--together with their compact size, varied PAM sequences, and high specificity--make them attractive tools for in vivo genome editing

2
Sequencing depth overcomes extraction bias: repurposing human WGS data for salivary microbiome profiling

Velo-Suarez, L.; Herzig, A. F.; Bocher, O.; Le Folgoc, G.; Le Roux, L.; Delmas, C.; Zins, M.; Deleuze, J.-F.; Hery-Arnaud, G.; Genin, E.

2026-04-01 genomics 10.64898/2026.03.27.714786 medRxiv
Top 0.1%
32.3%
Show abstract

Large-scale human genomic projects have generated whole-genome sequencing (WGS) data from hundreds of thousands of individuals, primarily to study host genetic variation. When saliva is the DNA source, the resulting datasets also contain microbial reads that are routinely discarded. Here, we investigate whether these host-centric WGS workflows can yield reliable microbiome profiles, effectively doubling the research value of existing data without additional sampling. We compared non-human reads from 39 deeply sequenced saliva samples from the GAZEL cohort (miG dataset; median [~]43 million reads/sample) with 14 samples processed with microbiome-optimized extraction (ASAL; median [~]4.3 million reads/sample), using two complementary classifiers: meteor, a coverage-based mapper against a curated saliva-specific database, and sylph, a k-mer classifier against the Genome Taxonomy Database (GTDB). Despite the absence of microbial lysis optimization, miG samples showed up to 3-fold higher species richness, [~]10-fold greater sequencing depth, and significantly lower inter-sample variability (PERMANOVA R{superscript 2} = 0.10, p = 0.001; BETADISPER p = 0.0036). Rarefaction to 10 reads eliminated most compositional differences, demonstrating that sequencing depth is the primary driver of community stability. Only [~]2% of detected taxa (12 of 592) showed extraction-related differences. The two classifiers exhibited fundamentally different depth-sensitivity profiles, with sylph retaining systematic detection asymmetries even after depth normalization, highlighting that classifier choice introduces biases that affect cross-study comparisons. These results show that biobank WGS data from saliva can be repurposed for robust, population-scale oral microbiome analyses, enabling simultaneous investigation of host genomic variation and the microbiome from the same archived samples. ImportanceSaliva-based whole-genome sequencing datasets generated across various cohorts to study human genetics contain non-human reads that are routinely discarded, thereby overlooking valuable microbial information. We show that these reads are sufficient to reconstruct robust oral microbiome profiles -- without any additional sampling or laboratory work. This finding unlocks a vast archive of existing genomic data for retrospective microbiome research, enabling population-scale studies of oral microbial diversity, host-microbiome interactions, and disease associations at minimal additional cost. We further demonstrate that the choice of taxonomic classifier introduces systematic, depth-dependent biases that persist even after normalization, a practical consideration for any cross-cohort or multi-platform microbiome study.

3
Adaptive sampling-based enrichment enables genome reconstruction of intracellular symbionts despite host background and reference divergence

Huang, W.-K.; Yang, C.-H.; Chung, H.; Lee, Y.-C.; Wu, Y.-C.; Chen, Y.-T.; Wan, M.-H.; Yeh, W.-S.; Hong, Y.-P.; Wu, T.-H.; Li, J.-C.; Liu, W.-L.; Chen, C.-H.; Chen, Y.-T.

2026-03-27 genomics 10.64898/2026.03.25.714109 medRxiv
Top 0.1%
28.1%
Show abstract

Recovering genomes of intracellular microbes from host-dominated samples remains a major challenge in microbial genomics, due to low target abundance, overwhelming host DNA, and the inability to culture these organisms independently. Despite extensive interest in Wolbachia, efficient genome recovery directly from host tissues remains limited by the inefficiency of host-dominated sequencing and the constraints of existing enrichment strategies. Here, we demonstrate that Oxford Nanopore adaptive sampling (AS) enables efficient, real-time enrichment of target DNA directly from complex host tissues, providing a culture-free approach for genome recovery in such systems. To our knowledge, this represents the first application of enrichment-mode adaptive sampling to achieve de novo reconstruction of an intracellular endosymbiont genome in a mosquito system. Using Aedes aegypti mosquitoes infected with a locally derived wAlbB-like strain, we applied enrichment-mode AS to selectively sequence Wolbachia DNA. This resulted in an increase from <1% Wolbachia reads in conventional shotgun data to [~]90% under adaptive sampling. De novo assembly of AS-enriched long reads yielded a near-complete genome ([~]1.5 Mb) in two contigs with >96-99% completeness. Comparative analyses revealed multiple large-scale chromosomal rearrangements relative to the reference wAlbB genome, demonstrating that adaptive sampling does not impose reference-dependent genome structure. Annotation further identified three prophage-associated regions, including two strain-specific expansions absent from the reference genome. Notably, cytoplasmic incompatibility genes (cifA and cifB) were identified adjacent to one of these regions, consistent with their known genomic association with prophage elements. Importantly, adaptive sampling remained effective despite substantial structural divergence between the reference and target genomes, revealing an unexpectedly robust application of this approach beyond its presumed operating conditions. Together, these results establish enrichment-mode adaptive sampling as a robust and scalable strategy for genome-resolved analysis of intracellular bacteria in host-associated systems.

4
Scalable Microbiome Network Inference: Mitigating Sparsity and Computational Bottlenecks in Random Effects Models

Roy, D.; Ghosh, T. S.

2026-03-31 bioinformatics 10.64898/2026.03.27.714858 medRxiv
Top 0.1%
27.8%
Show abstract

The application of Large Language Models (LLMs) and Transformers to biological and healthcare datasets requires the extraction of highly accurate, noise-filtered ecological networks. The Random Effects Model (REM) is a powerful statistical method for inferring microbial interaction networks and identifying keystone species across heterogeneous studies. However, existing implementations in R that rely on single-threaded "Iteratively Reweighted Least Squares" (IRLS) are computationally prohibitive for high-dimensional metagenomic data, creating a significant bottleneck for downstream machine learning pipelines. In this paper, we present Parallel-REM, a highly scalable, Python-based parallel pipeline accelerating large-scale network inference. By integrating robust variance filtering, sparsity checks, and a batched Master-Worker parallelisation strategy using joblib and statsmodels, we resolve native convergence failures associated with sparse biological matrices. Benchmarking on a massive clinical dataset comprising 70,185 samples and 466 optimal species demonstrates a 26.1x speedup over sequential baselines on a 64-core architecture, reducing computation time from days to minutes. Furthermore, statistical validation shows > 99.9% directional concordance with the original R implementation. Parallel-REM democratises largescale network extraction, providing the high-throughput infrastructure necessary to feed clean, topological and biological features into modern deep learning and Transformer-based diagnostic architectures.

5
Panmap: Scalable phylogeny-guided alignment, genotyping, and placement on pangenomes

Kramer, A. M.; Zhang, A.; Ayala, N.; de Sanctis, B.; Karim, L. M.; Hinrichs, A. S.; Walia, S.; Turakhia, Y.; Corbett-Detig, R.

2026-03-30 bioinformatics 10.64898/2026.03.29.711974 medRxiv
Top 0.1%
23.0%
Show abstract

Pangenomes capture population-level variation but remain computationally challenging at scale. We present Panmap, a tool that leverages evolutionary structure to place, align, and genotype sequencing reads against mutation-annotated pangenomes containing up to millions of genomes. Panmap introduces a phylogenetically compressed k-mer index that stores only sequence differences along branches, enabling efficient comparison of reads to both sampled genomes and inferred ancestors. This approach reduces index size by up to 600-fold and construction time by over three orders of magnitude relative to existing tools. Panmap places a 100x coverage SARS-CoV-2 sample onto 20,000 genomes in 0.4 seconds and onto 8 million genomes in under two minutes. Furthermore, it enables accurate haplotype identification and abundance estimation in metagenomic samples and sensitive placement of ancient environmental DNA without prior alignment. Our approach makes large-scale pangenomes directly amenable to read mapping, genome assembly, alignment-free phylogenetic placement, and metagenomic analysis.

6
GraphBG: Fast Bayesian Domain Detection via Spectral Graph Convolutions for Multi-slice and Multi-modal Spatial Transcriptomics

Do, V. H.; Tran, T. P. L.; Canzar, S.

2026-03-31 bioinformatics 10.64898/2026.03.28.715026 medRxiv
Top 0.1%
22.9%
Show abstract

Spatial transcriptomics (ST) technologies enable measurement of gene expression with spatial context, offering unprecedented insight into tissue architecture and cellular microenvironments. A fundamental analysis task is the identification of spatial domains, i.e., contiguous regions with distinct molecular profiles. As ST datasets scale to larger tissue areas, multiple slices, and multiple molecular modalities, there is a growing need for clustering methods that are accurate, scalable, and capable of integrating diverse spatial and molecular signals. We present GraphBG, a unified and scalable framework for spatial domain detection in ST data. GraphBG integrates approximate spectral graph convolutions with a variational Bayesian Gaussian mixture model, enabling robust representation learning and clustering of spatially coherent domains. We extend this core model to support multi-slice analysis (GraphBG-MS) through metacell aggregation, batch correction, and joint clustering, and to multi-modal spatial omics data (GraphBG-MM) via modality-specific graph encodings and kernel canonical correlation analysis. Across diverse real and simulated datasets, GraphBG consistently outperforms existing methods in domain coherence, scalability, and biological interpretability. Notably, it accurately clusters over 370,000 cells from 31 MERFISH tissue slices in just 5 minutes and integrates spatial transcriptomic and proteomic data for improved domain resolution. Applying GraphBG-MS to mouse liver ST data, we show that it captures canonical lobular zonation and disease-specific remodeling, highlighting its ability to reveal biologically meaningful tissue organization.

7
Epigenomic methylome landscape of promoters in vertebrate genomes

Lee, Y. H.; Lee, C.; Jarvis, E.; Kim, H.

2026-03-30 bioinformatics 10.64898/2026.03.29.715150 medRxiv
Top 0.1%
22.7%
Show abstract

Genomic promoters are crucial gene regulatory elements1,2. Yet, comparative analyses of promoter architecture have been constrained by the limited resolution of GC-rich regions in short-read-based genome resources3-6. The Vertebrate Genomes Project (VGP) provides more complete long-read-based assemblies7, which further detect 5-methylcytosine signals directly from PacBio HiFi circular consensus reads8,9. Here, we developed a scalable computational framework to characterize DNA methylomes from HiFi data on high-quality Phase I VGP assemblies with RefSeq gene annotations for 82 vertebrate species spanning seven major taxonomic classes: mammals, birds, reptiles, amphibians, lobe-finned fishes, ray-finned fishes, and cartilaginous fishes. We observed a conserved, transcription start site-centered hypomethylation signature in promoters across all vertebrates, and an unexpected hypermethylation signature near gene boundaries that is discordant with transcripts. In addition to this conserved pattern, there were lineage-specific differences in promoter methylation profiles, with birds showing the most diverse patterns. These epigenetic landscapes track phylogenetic relationships more closely than tissue-type methylation differences and infer lineage-dependent widths of core promoters and broader promoters across major vertebrate classes. Our findings establish a comparative epigenomic framework for profiling promoter methylomes from long-read sequencing data.

8
The U-method: Leveraging expression probability for robust biological marker detection

Stein, Y.; Lavon, H.; Hindi Malowany, M.; Arpinati, L.; Scherz-Shouval, R.

2026-04-02 bioinformatics 10.64898/2026.03.31.715470 medRxiv
Top 0.1%
22.7%
Show abstract

Reliable identification of cluster-defining markers is fundamental to single-cell transcriptomic analysis, yet current approaches often rely on average expression differences, which can dilute biologically informative signals in sparse and heterogeneous data. Here we introduce the U-method, a fast probability-based framework for identifying uniquely expressed genes (UEGs) by contrasting a genes expression probability within a cluster with its highest expression probability in any other cluster. This highest-probability comparison prioritizes detection consistency over expression magnitude, resulting in markers that consistently identify cell populations across independent datasets analyzed at comparable clustering resolutions. Applied to colorectal, breast, pancreatic, and lung cancer single-cell RNA-sequencing datasets, the U-method identifies canonical lineage markers together with additional genes showing clear cluster specificity. When projected onto Visium HD spatial transcriptomics data using only raw average expression of top UEGs, these signatures reveal coherent and biologically interpretable tissue organization without the need for smoothing, deconvolution, or model-based spatial inference. These results position the U-method as a practical implementation of detection consistency, enabling robust marker discovery and spatial interpretation in single-cell analysis.

9
LoRTIA Plus: a chemistry-agnostic, feature-first software package for long-read transcriptome annotation

Torma, G.; Balazs, Z.; Fulop, A.; Tombacz, D.; Boldogkoi, Z.

2026-04-04 genomics 10.64898/2026.04.03.716279 medRxiv
Top 0.1%
22.6%
Show abstract

Long-read RNA sequencing (lrRNA-seq) enables direct reconstruction of full-length transcripts, yet existing annotation tools show variable performance across genomes and library chemistries, particularly for novel isoforms. We present LoRTIA Plus, a chemistry-agnostic suite for transcriptome annotation and reconstruction from lrRNA-seq data. LoRTIA Plus first detects and filters transcription start sites (TSSs), transcription end sites (TESs), and introns using adapter-aware and quality-based criteria, and evaluates read support before assembling high-confidence transcript models. We benchmarked LoRTIA Plus against bambu, FLAIR, IsoQuant, and NAGATA on KSHV transcriptomes with dense overlap, using a validated literature-supported boundary set, and on transcriptomes from three human cell lines from the Long Read RNA-seq Genome Annotation Assessment Project (LRGASP) sequenced with five long-read chemistries. On KSHV, LoRTIA Plus achieved the highest F1 scores for TSSs, TESs, and transcripts in both direct-cDNA and direct-RNA datasets by improving recall without sacrificing precision. Across human datasets, LoRTIA Plus consistently ranked among the top boundary annotators across all chemistries and was the best-performing tool in PCR-based libraries, while remaining highly competitive on native RNA. Junction- and isoform-level analyses show that LoRTIA Plus yields a rich, reproducible repertoire of novel isoforms and transcript boundaries from viral to human transcriptomes.

10
SNMF: Ultrafast, Spatially-Aware Deconvolution for Spatial Transcriptomics

Alonso, L.; Ochoa, I.; Rubio, A.

2026-03-19 bioinformatics 10.64898/2026.03.17.712043 medRxiv
Top 0.1%
22.6%
Show abstract

Sequencing-based spatial transcriptomics has revolutionized the study of tissue architecture, but its spots often contain multiple cells, creating a key computational challenge, termed deconvolution, to decipher each spots cell-type composition. Reference-free deconvolution methods avoid the need for a matched single-cell RNA-seq dataset, but typically neglect the spatial correlation between neighboring spots and do not leverage modern hardware for efficient computation. Here, we propose SNMF (Spatial Non-negative Matrix Factorization): a rapid, accurate, and reference-free deconvolution method. SNMF extends the standard NMF framework with a spatial mixing matrix that models neighborhood influences, guiding the factorization toward spatially coherent solutions. Our R package is, to our knowledge, the first spatial transcriptomics deconvolution tool to natively support GPU execution, completing benchmark analyses in under one minute--over two orders of magnitude faster than the slowest competing methods-- with moderate memory requirements. On synthetic and real benchmark datasets, SNMF significantly outperforms state-of-the-art methods in deconvolution accuracy, and on a human melanoma dataset it recovers biologically meaningful cell-type signatures--including a tumor-boundary transition zone-- without any reference input. The proposed mehtod is publicly available at https://github.com/ML4BM-Lab/SNMF.

11
Allos: an integrated Python toolkit for isoform-level single-cell and spatial in-situ transcriptomics

Mcandrew, E.; Diamant, A.; Vassaux, G.; BARBRY, P.; Lebrigand, K.

2026-03-26 bioinformatics 10.64898/2026.03.24.713944 medRxiv
Top 0.1%
22.5%
Show abstract

Single-cell RNA sequencing and spatial transcriptomics have transformed our understanding of the transcriptional landscape by enabling high-resolution profiling of gene expression. Yet most experimental pipelines and their associated analysis frameworks collapse transcript diversity into gene-level counts, obscuring alternative splicing and isoform usage. The increasing ability of long-read sequencing to recover full-length transcripts from single cells and spatially barcoded tissues has created a pressing need for computational frameworks to support the storage, analysis, and visualisation of isoform-resolved data. Existing tools for isoform and splicing analysis either specialise in bulk, single-cell, or spatial RNA-seq assays in isolation and remain fragmented across languages and data models, limiting interoperability and hindering widespread adoption. We present Allos, a Python framework for isoform-level single-cell and spatial transcriptomics analysis. Built on the AnnData data model, Allos natively represents transcript-level quantification and integrates directly with GTF/GFF and FASTA annotations. Allos enables differential isoform usage screening, multi-panel visualisation, structural transcript interpretation, and protein-level analysis across bulk, single-cell, and spatial assays from both long- and short-read sequencing. Its modular design and scverse compatibility allow isoform-resolved analyses to run alongside established gene-level workflows, linking transcript-level screening with structure-aware visualisation and downstream interpretation. Allos is open-source and available at https://github.com/cobioda/allos, with comprehensive documentation and tutorials provided online.

12
CleanFinder: A Scalable Framework for Comprehensive Genome Editing Analysis

Ramachandran, H.; Dobner, J.; Nguyen, T.; Binder, S.; Tolle, I.; Vykhlyantseva, I.; Krutmann, J.; Miccio, A.; Staerk, C.; Brusson, M.; Kontarakis, Z.; Prigione, A.; Rossi, A.

2026-03-25 genetics 10.1101/2025.10.23.684080 medRxiv
Top 0.1%
22.4%
Show abstract

Precise validation of genome editing by targeted sequencing is a critical, multi-step process. Existing tools often separate amplicon definition from data analysis, creating fragmentation and added complexity. We developed CleanFinder, a browser-native application that unifies these steps. Based on user-provided sgRNAs or primers, CleanFinder retrieves the corresponding genomic context, automatically defines an amplicon, and sets robust sequence anchors. These anchors then guide alignment of sequencing reads, enabling accurate quantification of editing outcomes without relying on static, pre-loaded genome databases. Its analytical engine performs a comprehensive assessment of the sequencing data: it automates the classification of reads into key functional categories while simultaneously identifying heterozygous Single Nucleotide Polymorphisms (SNPs) to enable direct assessment of allelic dropout. To provide crucial biological context, the tool incorporates an interactive gene viewer that maps sgRNA targets and visualizes transcript-specific coding sequences, protein translations, and overall gene structure. Importantly, CleanFinder operates entirely client-side, ensuring complete data privacy as genomic information is never uploaded and no installation is required. By integrating these advanced analytical and visualization capabilities into a secure, all-in-one solution, CleanFinder makes robust genome editing analysis accessible to any researcher, regardless of their bioinformatics expertise.

13
OxBreaker: species-agnostic pipeline for the analysis of outbreaks using nanopore sequencing

Reding, C.; Hopkins, K. M. V.; Colpus, M.; Sanderson, N. D.; Gentry, J.; Oakley, S.; Campbell, M.; Karageorgopoulos, D.; Jeffery, K. J. M.; Eyre, D. W.; Bejon, P.; Stoesser, N.; Walker, A. S.; Young, B. C.

2026-03-19 genomics 10.64898/2026.03.18.709804 medRxiv
Top 0.1%
22.4%
Show abstract

Real-time genomic surveillance may mitigate the spread of health-care-associated infections, but whole-genome sequencing costs and the need for specialised expertise constrain its wide implementation in public health. Here we present OxBreaker, an automated and species-agnostic pipeline optimised for the high-resolution analysis of bacterial and plasmid genomes sequenced via Oxford Nanopore Technologies (ONT). OxBreaker streamlines the transition from raw reads to phylogenetic inference through automated reference selection and high-accuracy variant calling. It is accessible via a graphical user interface (GUI) that can be easily installed locally and operated by non-specialists. Benchmarking against technical and biological replicates of high-priority pathogens demonstrates high accuracy, with false positive variant rates reduced to 0-4 single-nucleotide polymorphisms (SNPs) for common species. We further validated the pipeline by accurately characterising previously published clonal and plasmid-mediated outbreaks, reproducing established phylogenies with improved accessibility. By providing a stable, scalable, open-source offline-compatible solution that matches the resolution of short-read platforms while maintaining the speed of long-read technology, OxBreaker is designed to facilitate the adoption of local, real-time genomic surveillance for frontline infection prevention and control.

14
Novel Engineered AAV Variants Demonstrate Superior Blood-Brain Barrier Penetration and Safety in Non-Human Primates

Wang, Z.; Li, H.; Xu, X.; Sun, Z.; He, R.; Zhang, L.; Yu, M.; Wang, S.; Hu, C.; Liu, L.; Ren, L.; Xu, Y.; Xiao, T.; Li, D.; Sun, B.; Luo, Y.; An, Z.

2026-04-01 neuroscience 10.64898/2026.03.29.713052 medRxiv
Top 0.1%
22.3%
Show abstract

Systemic delivery of adeno-associated virus (AAV) for gene therapy of central nervous system (CNS) disorders is limited by inefficient blood-brain barrier (BBB) penetration and dose-limiting toxicity in peripheral organs, notably the liver and dorsal root ganglia (DRG)1-5. Here, we report the development of novel AAV variants via a proprietary capsid engineering platform (REACH). In non-human primates (NHPs), intravenous administration of lead variants resulted in transgene expression levels in the brain that were 600-2000 fold higher than AAV9 at the RNA level, concomitant with a 10-50 fold reduction in liver tropism and minimal off-target exposure in the heart and DRG. These engineered capsids achieve unprecedented, pan-CNS transduction with a markedly improved safety profile, representing a transformative platform for treating a broad spectrum of neurological diseases.

15
A rapid, sensitive, and quantitative high plex biomarker digital detection platform enabled by Hypercoding

Bathina, M.; Blum, A. P.; Brodin, J.; DeBuono, N.; Fu, Y.; Lu, B.; Naticchia, M. R.; Ortiz, D.; Richards, A.; Rozieres, C. d.; Schowalter, R.; Shultzaberger, S.; Snow, S.; Tanner, S.; Trejo, C. L.; Ward, S.; LeCoultre, R.; Read, K.; Sathe, S.; Schlegel, C.; Schlegel, I.; Shaner, S.; Tsay, J.; Weir, J.; Wong, K. M.; Abi-Samra, K.; Alldredge, J.; Anderson, P.; Bailey, J.; Bollig, C.; Bonnardel, J.; Bru, A.; Chan, A. C. S.; Chang, T.; DeBerg, L.; Doorn, J. v.; Driscoll, P.; Duarte, T.; Esparza, A.; Frerichs, D.; Gautherot, A.; Held, L.; Hendricks, G.; Holst, G.; Iwamoto, K.; Jimenez, H.; Khandan,

2026-03-25 genomics 10.64898/2026.03.23.711448 medRxiv
Top 0.1%
22.3%
Show abstract

Low-cost, multiplexed, and automated assays are needed to make omic technologies more broadly accessible in clinical, research, and commercial settings. We present Hypercoding, a scalable technology for detection and quantitation of multi-omic targets. Drawing from data reliability methods in the telecommunications field, Hypercoding uses fluorescent signals from hybridization with an error-correcting code to enable detection of high-plexity targets from biological samples, such as human DNA. In the presence of the target, a linear DNA construct is circularized, immobilized, and amplified to enable single-molecule detection of a target via rapid readout cycles within a 96-well plate. We demonstrate capability for >10,000 code plexity and accurate (98.7%) genotyping of 209 pharmacogenomic variants. Furthermore, we show computation of copy number variation with whole chromosome and sub-gene resolution, as well as quantitation of target abundance down to 10 fM sensitivity with a dynamic range of up to 10 logs.

16
STAPLE: automating spatial transcriptomics analysis and AI interpretation

Lvovs, D.; Quinn, J.; Forjaz, A.; Santana-Cruz, I.; Stapleton, O.; Vavikolanu, K.; Wetzel, M.; Data Science Hub TeamLab, ; Demystifying Pancreatic Cancer Therapies TeamLab, ; Pagan, V. B.; Herb, B. R.; Favorov, A.; Kagohara, L. T.; Kiemen, A. L.; Maitra, A.; Sidiropoulos, D. N.; Tansey, W.; Wood, L.; Deshpande, A.; Noble, M.; Fertig, E. J.

2026-04-01 bioinformatics 10.64898/2026.03.30.715127 medRxiv
Top 0.1%
22.2%
Show abstract

Spatial transcriptomics workflows often span separate tools for cell typing, neighborhoods, and cell-cell communication, yielding fragmented outputs that hinder scalability, interpretation, and reproducibility. STAPLE systematizes analyses across distinct methods into a modular framework, unifying data structures and cross-tool interoperability. End-to-end analyses are performed unassisted with a single invocation, fostering rigorous, reproducible spatial transcriptomics analysis. Its novel, AI-enabled reporting layer synthesizes quantitative results into summaries of biological findings, facilitating analysis interpretation.

17
DenMark: A Bayesian Hierarchical Model for Identifying Cell-Density Correlated Genes from Spatial Transcriptomics

Xu, M.; Schmidt, A.; Zhang, Q.

2026-04-04 genomics 10.64898/2026.04.02.713482 medRxiv
Top 0.1%
22.1%
Show abstract

Recent advances in single-cell-resolution spatial transcriptomics enable the profiling of gene expression while preserving the precise locations of individual cells, enabling quantitative investigation of how cellular organization relates to molecular state. A fundamental yet under-modeled aspect of organization is local cell density, which varies across microenvironments and can be linked to transcriptional programs. However, rigorous computational frameworks to quantify density-expression correlations remain lacking. Here, we present DenMark (Density-dependent Marked point process framework), a unified statistical framework that jointly models local cell locations and gene expression in single-cell-resolution ST data, enabling identification of density-correlated genes while naturally providing uncertainty quantification. To scale inference, DenMark leverages a Hilbert space Gaussian process approximation. In simulations, DenMark provides an accurate and well-calibrated estimate of density-expression association. Across single-cell ST platforms, including MERFISH and 10x Xenium, and across brain and cancer tissues, DenMark identifies genes whose expression is associated with cellular clustering and reveals density-related biological programs.

18
VPF-Class 2.0: a taxonomy-centered framework for automatic viral classification

Vidal, L. J.; Pons, J. C.; Fiamenghi, M. B.; Kyrpides, N.; Llabres, M.

2026-03-23 bioinformatics 10.64898/2026.03.20.713201 medRxiv
Top 0.1%
22.0%
Show abstract

Rapid expansion of viral sequence data demands classifiers that scale, track ICTV updates, and provide interpretable evidence. We present VPF-Class 2.0, an updated successor to VPF-Class, centred on the taxonomic classification, that retains marker-driven protein domain detection but replaces rule-based voting with a lightweight supervised model on per-genome marker-composition features. In controlled benchmarks, VPF-Class 2.0 achieves near-perfect family-level performance and strong genus-level accuracy while increasing confident annotation coverage. Under a practical confidence threshold (0.3), performance improves and matches or exceeds representative tools within shared taxonomic scopes. We further introduce an interpretability study that relates errors to the genus specificity of activated markers. Finally, we demonstrate applicability on large real-world viromes with consistent labels and substantial agreement with graph-based classifications. The implementation of VPF-Class 2.0 can be downloaded from https://github.com/luisvidalj/VPFClass2.git.

19
Baktfold: Sensitive protein functional annotation across the microbial tree of life using structural information

Bouras, G.; Lim, S. w.; Durr, L.; Vreugde, S.; Goesmann, A.; Edwards, R. A.; Schwengers, O.

2026-04-01 bioinformatics 10.64898/2026.03.31.715528 medRxiv
Top 0.1%
22.0%
Show abstract

The functional annotation of protein sequences has undergone tremendous progress over recent years, but still too-many protein sequences remain as so-called hypothetical proteins after applying state-of-the-art genome annotation software pipelines. Here, we introduce Baktfold, a new command line software tool for the ultra-sensitive but taxon-independent fast annotation of protein sequences across the microbial tree of life. Baktfold conducts sequential protein structure-based searches against four complementary structure databases. Protein sequences are transformed into Foldseek 3Di tokens via the ProstT5 protein language model and subsequently searched against structure databases via Foldseek. All results are exported in GFF3 and INSDC-compliant flat files as well as comprehensive JSON files facilitating automated downstream analysis 100% interoperable with the popular bacterial annotation tool Bakta. We compared Baktfolds performance in terms of wallclock runtime and functional annotation of hypothetical proteins from various sources including bacterial and archaeal isolates, plasmids, metagenomic-assembled genomes and micro-eukaryotes. When benchmarked on over three hundred thousand species representatives across the prokaryotic tree of life, Baktfolds median overall bacterial genome annotation rate is 87.8% compared to 72.9% with Bakta, while Baktfolds median bacterial annotation rate of remaining hypothetical proteins is 50.1% (n=290258). For archaea, Baktfolds overall median annotation rate is 71.5% compared to Prokkas 35.8%, with a median archaeal annotation rate of hypothetical proteins of 68.0% (n=14058), making Baktfold the most sensitive automated archaeal annotation method by far. Baktfold is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a MIT license at https://github.com/gbouras13/baktfold. Data SummaryO_LIBaktfold was developed in Python as a command line application for Linux and MacOS C_LIO_LIThe complete source code and documentation are available on GitHub under an MIT license: https://github.com/gbouras13/baktfold C_LIO_LIThe Baktfold database is hosted at Zenodo (https://zenodo.org/records/17347516) mirrored on HuggingFace (https://huggingface.co/datasets/gbouras13/baktfold-db) C_LIO_LIBaktfold is available via bioconda (https://anaconda.org/bioconda/baktfold) and PyPI (https://pypi.org/project/baktfold/) C_LIO_LIBaktfold can also be run without local installation using Google Colab at https://colab.research.google.com/github/gbouras13/baktfold/blob/main/run_baktfold. ipynb C_LIO_LIAll supplementary code, data and files required to reproduce the results of this manuscript are available at https://github.com/gbouras13/baktfold-analysis (code and small data) and https://zenodo.org/records/19333697 (large data) C_LI

20
General-purpose embeddings for long-read metagenomic sequences via β-VAE on multi-scale k-mer frequencies

Nielsen, T. N.; Lui, L. M.

2026-03-23 bioinformatics 10.64898/2026.03.19.713080 medRxiv
Top 0.1%
22.0%
Show abstract

Long-read metagenomics routinely produces millions of assembled contigs, creating a need for methods that organize sequences into biologically meaningful groups across samples and environments. We present a general-purpose compositional embedding for metagenomic sequences based on a {beta}-variational autoencoder ({beta}-VAE) trained on multi-scale k-mer frequencies (1-mers through 6-mers; 2,772 features with centered log-ratio transformation). The embedding compresses each contig into a 384-dimensional vector that preserves local compositional similarity, enabling similarity search and graph-based clustering from sequence composition alone. Through systematic comparison of fifteen models trained on up to 17.4 million contigs (525.5 Gbp) from brackish, terrestrial, and reference genome sources, we find that a small set of curated prokaryotic reference genomes (656,000 contigs) outperforms ten-fold larger domain-specific training sets, and that neither reconstruction loss nor Spearman correlation reliably predicts downstream clustering quality. On nearest-neighbor graphs, flow-based clustering (MCL) markedly outperforms modularity-based methods (Leiden), yielding 12,123 clusters from 154,041 contigs ([&ge;] 100 kbp) with 99.2% phylum-level purity confirmed by independent marker gene phylogenetics. Multi-method taxonomic annotation achieves 87% coverage and reveals that 16.4% of contigs are eukaryotic--the single largest component invisible to standard prokaryotic annotation tools. The embedding provides a sample-independent coordinate system for organizing metagenomic sequence space at scale.